AO-A050  629  AlR  FORCE  HUMAN  RESOURCES  LAB  BROOkS  AFB  TEX  F/6  5/9 

SIMULATED  AND  EMPIRICAL  STUDIES  OF  FLEXILEVEL  TESTING  IN  AIR  FO~ETC(U) 
SEP  77  da  HARRIS*  R J PENNELL 


1 “=1 

AD 

40606^9 

B 

i 

u 

1 

j — 

i 

* 

seniE 

'.wm*' 

t 

^ “ 

• ^ . 

f- 

. r, , 

END 

DATE 

FILMED 

-- 

ir"  ‘ ■ 

■ V- 

4«  /o 

ADA050829 


AFHRL-TR-77-61 


AIR  FORCE  SYSTEMS  COMMAND 
BROOKS  AIR  FORCE  BASEJEXAS  78235 


I 


NOTICE 


When  U.S.  Government  drawings,  specifications,  or  other  data  are  used 
for  any  purpose  other  than  a definitely  related  Government 
procurement  operation,  the  Government  thereby  incurs  no 
responsibility  nor  any  obligation  whatsoever,  and  die  Csct  that  the 
Govenunent  may  have  formulated,  furnished,  or  in  any  way  supidied 
the  said  drawings,  specifications,  or  other  data  is  not  to  be  regarded  by 
implication  or  otherwise,  as  in  any  maimer  licensing  die  holder  or  aiiy 

other  person  or  corporation,  or  conveying  any  rights  or  permission  to  ^ > 

manufacture,  uk,  or  sell  any  patented  invention  that  iruy  in  any  wsy  '' 

be  related  thereto. 

This  final  report  was  submitted  by  Technical  Training  Division,  Air  , 

Force  Human  Resources  Laboratory,  Lowry  Air  Force  Base,  Colorado 
80230,  under  project  1121,  with  HQ  Air  Force  Human  Resources 
Laboratory  (AF^),  Brooks  Air  Force  Base,  Texas  7823S. 

This  report  has  been  reviewed  and  cleared  for  open  puUication  and/or 
public  release  by  the  appropriate  Office  of  Information  (OQ  in 
accordance  with  AFR  190-17  and  DoDD  5230.9.  Ihere  is  no  otjection 
to  unlimited  distribution  of  this  report  to  the  public  at  large,  or  by 
DDC  to  the  National  Technical  Information  Service  (N11S). 

This  technical  report  has  been  reviewed  and  is  approved  for  pubHcatioiL  < 

J 

MARTY  R,  ROCKWAY,  Tedinical  Director  I ; 

Technical  Training  Division  I ' 


DAN  D.  FULGHAM,  Colonel,  USAF 
Cam.mander 


UndiMtfied 

CLASSIFICATION  OF  THIS  FAOE  (TSAwi  Dim  Mnlm4) 


REPORT  DOCUMENTATION  PAGE 


fED  ANP,£liPIRICALg:UDI 
IN ^IR^3kCE  J^I^AI 


fDIESOF 
AL 


S.  FERFORMINO  OROANIZATION  NAME  AND  ADDRESS 

Technical  Tiaining  DivisiiHi 

Ail  Fwce  Human  Resources  Laboratory 

Lowry  Air  Force  Base,  Cdorado  80230 


Bnxdcs  Air  Force  Base,  Texas  78235 


READ  INmtUCnOHt 
BEFORE  COMPLETINO  FORM 


Su  recirient's  catalog  nuMacR 


FERFORMINO  ORO.  RERORT 


ONTRACT  OR  grant  NUMGERTt; 


MONITORING  AGENCY  NAME  A AOOK€%»(ll  aititrtnl  from  Canirolling  Offleo) 


Sa.  OECLASSIFICATION/OOWNGRADING 
SCHCOULE 


IS.  DISTRIBUTION  STATEMENT  (of  Ihit  Ropotl) 


Approved  for  public  release;  distribution  unlimited. 


17.  DISTRIBUTION  STATEMENT  (of  Uio  obttrmct  ontorod  tn  Stock  20,  ft  dfffotmt  from  ttcfCoH) 


If.  key  words  (Cenllmio  onrararat  alrfa  It  naeaatarr  end  fdonllty  br  Sloe*  nuaibar) 

adaptive  testing 
computerized  testing 
flexilevel  strategy 
timulation  studies 
technical  training 


ao.  Nv**^AACT  fCanllnua  on  raaaraa  tido  tt  naaaaaarr  ond  tdontity  by  block  number) 

^ This  Study  used  a aeries  of  simulations  to  answer  questions  raised  by  empirical  studies.  The  first  study  showed 
that  for  reasonable  hi^  entry  points,  parameters  estimated  from  paper-and-penefl  test  protocols  cross-validated 
remarkably  wHl  to  groups  actually  tested  at  a cottqputer  terminal.  This  suggested  that  feasibility  studies;  l.e.,  running 
actual  subjects,  may  not  be  called  for.  Ihe  second  study  showed  that  the  proportion  correct  during  flexilevel  testing 
was  a sensitive  measure  of  student  performance,  h was  also  concluded  that  the  modest  time  savings  (12  to  15 
percent)  was  due  to  the  parameters  used  to  tanplement  flexilevel  testing.  Study  HI  showed  that  a 50  percent  savings 
in  hems,  and,  potentially,  a large  savings  in  test  time  could  be  realized  through  the  tanpiementttion  of  alternate 
flexilevel  strategies.  In  summary,  the  overall  condusion  from  the  three  studies  was  tiiat  flexilevel  testing,  with  - 


M FORM 
W i JAN  7S 


EDITION  OF  • NOV  SS  IS  OBSOLETA 

4- Of 


Undassifled 


SECURITY  CLASSIFICATION  OF  THIS  RAGE  riNiwi  Data  BMaratQ 


SlCuniTV  CLMUriCATtON  or  THIS  rAOtrWi««  Dmm  Btiffi) 


Ot^ctive 

Mediod 

Results  and  Diicussioii 
Conclusions 


m.  Study  n 8 

Otijective 8 

Method 8 

Results 8 

Condudons 12 

IV.  Stu^  m 12 

Otqective 12 

Mediod 12 

Results  and  Discussion  • • • 12 

Condusions  and  Recommendations 19 

References 20 


LIST  OF  TABLES 


TUli 

1 Item  Diffkultiet,  Groiq?  C ^ 

2 Regmeioa  Wei^tt  and  VaUditiet,  Group  € 6 

3 VaUditiea  by  Entry  Fbint  b 

4 Percent  Rema  Required  to  Tnminate  Teating 7 

5 Item  Error  In  Firedicting  Total  Score  7 

6 Claadfication  Error  by  Entry  Point 7 

7 Itemi  Compriaing  Scalet  and  DifEcultiei  for  the  Hock  II  Teat  9 

S Itema  Compriaing  Scalea  and  Difficultiea  for  the  Stock  IV  Teat 9 

9 Summary  Statbtica  for  Dependent  Meaauiea  . 10 

10  Regteaaion  Equationa  and  Claaaification  Analyaia  11 

11  Hme  (in  Minutea)  to  Complete  Scalea 11 

12  hema  Compriaing  Scalea  and  Difficultiea  13 

13  Simulation  Reaulta  for  Block  n 14 

14  Simulation  Reaulta  for  Block  IV IS 

15  Simulation  Reaulta:  SPC  “ 2,  SS  ■ 3,  el  ■ 3 IS 

16  Regreaaton  Analyaet  for  % Itema  Saved 16 

17  Regreaaion  Analyaea  for  ftedlcting  Claia(S) 16 

18  Regreaaion  Analyaea  for  Ptedkting  Claaa  (F) 17 

19  Regreafion  Analyaea  Predicting  a^  17 

20  Regreaaion  Analyaea  Predicting  bj  16 

21  Regreaaion  Analyaea  Predicting  ap  16 

22  Regreaaion  Analyaet  for  bp  16 

23  Ifuhiide  Correlationa  Obtained  in  Croat'Validation  Study  19 


SIMULATED  AND  EMPIRICAL  STUDIES  OF  FLEXILEVEL  TESTING 
IN  AIR  FORCE  TECHNICAL  TRAINING  COURSES 


I INTRODUCTION 

In  an  environment  such  as  is  ofl'ered  by  the  Advanced  Instructional  System  (AIS),  the  potential 
benefits  derivable  from  adaptive  testing  become  a practical  reality.  The  AIS  is  an  advanced  development 
program  to  develop  a computer  based  educational  and  training  system  for  the  Air  Force.  The  heart  of  the 
system  is  a CDC  Cyber-70  Computer  which  cunently  manages  the  training  process  for  four  courses  at 
Lowiy  Technical  Training  Center,  Lowry  AFB,  Colorado,  through  two  types  of  terminals.  The  type  A 
terminal  is  an  interactive  plasma  display  terminal  with  graphic  capabilities,  while  the  type  B management 
terminal  has  test  form  reading  and  scoring  capabilities  along  with  a line  printer  for  issuing  student 
prescriptions.  The  system  is  designed  to  manage  the  individualized  instructional  process  of  a large  number 
of  students  who  spend  approximately  33  percent  of  their  time  in  a testing  mode.  Thus,  with  a large  student 
flow  through  AIS  courses  requiring  extensive  testing,  considerable  payoff  in  terms  of  reduced  training  time 
is  potentially  available  from  procedures  which  reduce  testing  times  without  compromising  instructional 
effectiveness. 

Adaptive  testing  has  been  investigated  under  a variety  of  rubrics  such  as  branched  testing,  response 
contingent  testing,  sequential  testing,  tailored  testing.  We  shall  use  the  general  term  adaptive  testing  to 
characterize  any  attempt  to  match  test  items  to  examinees  based  on  a response  history,  with  the  goal  of 
reducing  testing  time,  or  obtaining  more  valid  and/or  more  reliable  ability  estimates. 

Background 

Realizing  the  potential  of  adaptive  testing  in  a system  such  as  the  AIS,  the  Air  Force  Human 
Resources  Laboratory,  Brooks  AFB,  Texas,  initiated  a multi-phase  research  stu^  beriming  with  the 
identification  of  a suitable  algorithm  to  drive  an  adaptive  testing  program.  During  Phase  I,  the  flexilevel 
approach  of  Lord  (1971a,  1966b)  was  identified  as  the  tentative  algorithm  (Hansen,  Johnson,  Fagan,  Tam, 
& Dick,  1974).  Flexilevel  testing  has  a number  of  advantages  over  other  methods  of  adaptive  testing. 
Namely,  it  is  easily  implemented,  it  does  not  require  a large  item  pool,  and  theoretically  it  requires  only 
(n+l)/2  items  (where  n is  the  number  of  items  in  the  total  test  pool)  to  test  each  examinee.  For  example,  a 
25  item  test  would  require  only  13  items  to  test  each  examinee.  The  flexilevel  test  (Lord,  1971a,  1971b) 
first  administers  the  item  of  median  difficulty  (difficulty  levels  ascertained  from  pretesting).  If  an  item  is 
answered  incorrectly,  the  next  easiest,  unanswered  item  is  given.  If  an  item  is  answered  correctly,  the  next 
hardest,  unanswered  item  is  given.  An  examinee  continues  testing  until  he  has  answered  (n'*-l)/2  items. 

Phase  II  of  the  research  consisted  of  experimental  studies  conducted  in  the  Inventory  Management 
(IM)  and  Precision  Measuring  Equipment  (PME)  courses.  The  Block  II  test  of  the  IM  course  was  used  for 
the  implementation  of  Study  I (Hansen,  Harris,  & Ross,  1977a)  while  the  Block  II  and  Block  IV  tests  of 
PME  were  used  in  Study  II  (Hansen,  Harris,  & Ross,  1977b).  The  purpose  of  Study  I was  to  validate  the 
flexilevel,  adaptive  testing  paradigm  with  the  primary  goals  of  reducing  test  time.  Each  student  was 
individually  entered  in  the  test,  given  the  flexflevel  adaptive  test  and  then  all  remaining  items.  This  design 
was  employed  in  order  to  fulfill  the  operational  requirements  of  the  training  system.  The  results  revealed  an 
extremely  high  part-whole  correlation  (r  = .94)  between  the  flexilevel  and  total  test  scores.  The  flexilevel 
test,  however,  required  39.5  percent  fewer  items  with  a concomitant  time  savings  of  18.4  percent. 

As  mentioned.  Study  II  was  performed  in  Blocks  II  and  IV  of  the  PME  course.  A task  analysis  was 
used  to  group  items  into  five  scales  and  to  construct  a hierarchy  of  scales  w'r''in  the  test.  The  intention  was 
to  explore  the  feasibility  of  adaptively  testing  both  within  and  across  scales.  Test  validity  analyses  yielded 
high  part-whole  correlations  between  adaptive  test  and  total  test  scores  (r’s  = .95).  In  addition,  the  time 
savings  associated  with  adaptive  testing  approximated  30  percent  for  both  blocks  (Hansen,  Harris,  & Rost, 


3 


1977b).  Following  completion  of  the  two  empirical  ttudiet  teveial  questions  concerning  the  efficacy  of 
adaptive  testing  remained  to  be  answered. 

The  purpose  of  this  report  is  to  present  the  results  of  three  simulation  studies  designed  to  answer 
questions  raised  by  the  empirical  studies.  The  first  simulation  study  was  designed  to  evaluate  the  need  for 
conducting  empirical  studies.  The  second  simulation  study  was  designed  to  reconstruct  the  testing  situation 
and  analyze  the  data  for  different  purposes.  And  finally,  the  third  study  was  designed  to  simulate,  using 
Study  II  test  protocols,  the  effects  of  adaptive  movement  across  scales  as  well  as  within  scales. 


a STUDY  I 


Objective 

The  thrust  of  Study  I was  to  explore  the  kinds  of  conclusions  which  might  be  made  by  simulating 
flexilevel  testing  on  paper-and-pencil  protocols  and  comparing  the  results  (i.e.,  estimated  parameters)  to 
those  data  actually  collected  on  the  computer  terminal  (Phase  II).  The  intent  was  to  evaluate  the  extent  to 
which  die  actual  implementation  and  testing  of  the  model  on  a computer  terminal  can  be  avoided. 

A number  of  simulation  studies  of  adaptive  testing  have  been  conducted;  among  these  are  Bryson 
(1972);  Cleary,  Linn,  and  Rock  (1968a,  1968b);  Linn,  Rock,  and  Geary  (1970);  and  Patterson  (1%2). 
These  studies  have  largely  been  concerned  with  ascertaining  the  potential  benefits  derivable  fr6m  an 
adaptive  testing  paradigm,  rather  than  extrapolating  simulated  results  to  actual  adaptive  data  as  this  study 
did.  Basically,  the  question  posed  by  the  present  study  was,  “Must  one  actually  conduct  an  empirical  study 
such  u that  conducted  during  Phase  II  to  ascertain  adaptive  testing  feasibility?”  And  furthermore,  “To 
what  extent  do  simulated  results  parallel  results  under  actual  PLATO  testing  conditions?” 


Meriiod 

A sample  of  186  paper-and-pencil  protocols  was  obtained  from  Inventory  Management/Materiel 
Facilities  (IM/MF)  Block  11.  The  test  was  composed  of  the  same  items  used  in  the  Phase  II  experiment.  The 
sample  was  dirided  into  two  equal  parts;  i.e.,  a calibration  (C)  sample  and  a validation  (V)  sample.  The  C 
sample  was  used  to  estinute  parameters  necessary  to  implement  the  flexilevel  testing  algorithm.  These 
parameters  were  then  validated  on  the  V sample  in  order  to  evaluate  the  stability  of  various  dependent 
measures.  The  parameters  estimated  were  the  item  difficulties,  vduch  imply  the  item  ordering  for  flexilevel 
presentation,  and  the  regression  parameters  for  converting  the  flexilevel  score  into  an  estimated  total  score. 
Admittedly,  the  flexilevel  score  could  have  been  used  to  make  the  necessary  pass/fail  decisions  required  in  a 
criterion-referenced  testing  situation  sudi  as  found  in  Air  Force  technical  training;  however,  for  two  reasons 
it  was  desirable  to  translate  back  to  the  total  score  metric  (percent  correct).  First,  this  is  the  metric 
traditiotudly  used  to  assign  scores,  and  second,  the  extent  to  which  the  flexilevel  score  reproduces  the  total 
score  is  a prime  dependent  measure  in  evaluating  the  feasibility  of  flexilevel  testing.  The  flexilevel  score  was 
derived  as  follows:  Let  A index  the  set  of  items  taken  under  flexilevel  testing  and  let  di,  i e A,  represent  the 
difficulty  of  the  i-th  item  expressed  as  percent  of  the  C sample  answering  correctly.  Further,  let 


1 if  item  i answered  correctly, 
-1  if  item  i answered  incorrectly. 


Then,  the  flexilevel  score  for  the  j-th  examinee  was  defined  as 


2 

ieA 


8.dj 


(1) 


Stated  more  simply,  Fj  was  the  sum  of  the  difficulties  of  items  answered  correctly  minus  the  sum  of  the 
difficulties  of  iterrs  answered  incorrectly. 


since  the  total  scoie,  say,  was  available  as  the  sum  of  correct  responses  divided  by  the  number  of 
items  in  the  item  pool,  (n  = 25),  we  used  the  usual  regression  equation, 

= a + bF.  (2) 

to  estimate  the  total  score  and  the  associated  error  iXj-Xj  I.  It  should  be  noted  that  the  usual  flexilevel  rule 
of  administering  (n-i-l)/2  items  to  each  examinee  war  departed  from  in  both  the  Phase  II  study  and  here. 
That  is,  testing  for  a particular  examinee  was  terminated  if  he  was  to  take  a harder  item,  but  had 
already  answered  all  of  the  harder  items,  or  if  he  was  to  take  an  easier  item,  but  had  already  taken  all  of 
these.  This  decision  rule  was  used  because  one  of  the  dependent  measures  was  the  number  of  items  required 
to  terminate  testing  as  a function  of  entering  examinees  at  varying  locations  on  the  item  hierarchy. 

The  dependent  variable  analyzed  besides  those  mentioned  above  (viz,  effect  of  item  hierarchy  variable 
entry  and  error  in  reproducing  total  score)  was  classification  error.  Here  we  examined,  for  a range  of 
criterion  levels,  the  errror  rate  using  5^  to  classify  students  as  failing  or  passing  relative  to  their  known 
classification  based  on  Xj. 

In  addition  to  the  C and  V samples,  a third  sample  (N  = 100)  was  obtained  by  randomly  selecting  test 
protocols  of  students  who  had  gone  throu^  Phase  II  testing  on  the  computer.  This  was  possible  since  at  the 
completion  of  each  flexilevel  session  (using  the  same  stopping  rule  described  above)  all  items  on  the  25  item 
instrument  which  had  not  been  administered  were  given.  Thus,  complete  item  protocols  were  available  on 
this  aoss-validation  (CV)  group. 

One  intention  of  the  Phase  II  study  was  to  explore  the  utility  of  adaptively  entering  examinees  into 
the  item  hierarchy.  The  entry  point  was  calculated  using  three  aptitude  tests  taken  before  the  students 
entered  training.  It  was  thou^t  that  adaptive  entry  might  further  reduce  testing  time  over  savings 
attributable  to  taking  only  (n+l)/2  items.  Unforturuitely  the  CV  sample  was  obtained  when  monitors  were 
halving  difficulty  obtaining  the  aptitude  scores.  Therefore,  the  majority  of  the  sample  was  entered  at  the 
(r»+l)/2-th  item. 

The  comparison  of  the  flexilevel  results  in  the  CV  group  using  tire  parameters  estimated  in  the  C 
group  explored  whether  a feasibility  study  such  as  Phase  II  needs  to  be  conducted.  Theoretically,  the  only 
difference  between  the  CV  and  C groups  was  the  use  of  a computer  terminal  to  administer  the  test.  This 
assumes  item  independence  in  the  sense  that  taking  items  in  a different  order  would  not  affect  the  test 
score. 


Reanlta  and  Diacutiion 

The  item  difficulties  for  the  25  items  under  study  are  presented  in  Table  1.  The  mean  item  difficulty, 
an  estinuite  of  the  mean  test  score,  was  .804.  Typically,  criterion-referenced  test  items  tend  to  be  quite 
easy;  however,  one  of  the  items  is  exceptionally  hard  (item  6).  Eliminating  item  6 raises  the  mean  to  about 
.84,  which  indicates  that  about  16  percent  of  the  sample  misses  an  average  item.  The  difficulties  in  Table  1 
implied  the  ordering  of  the  items  for  the  simulated  flexilevel  testing,  equal  item  difficulties  implied  an 
arbitrary  ordering. 

Next,  the  regression  parameters  for  Equation  2 were  estimated.  Regression  estimates  for  entering  the 
item  hierarchy  at  items  3,  5,  7,  9,  1 1, 13,  and  15  were  calculated.  These  estimates  are  presented  in  Table  2 
along  with  the  correlation  (validity)  between  X,  the  total  score,  and  F,  the  flexilevel  score  (see  Equation  1). 
The  lower  down  (easier  items)  on  the  item  hierarchy  students  were  entered,  the  more  items  were  required 
to  terminate  the  flexilevel  algorithm.  This  was  vividly  displayed  by  the  trend  of  the  regression  weights.  That 
is,  increasing  the  entry  point  reduced  the  constant  term,  a,  and  increased  the  importance  of  the  b term 
conesponding  to  the  flexilevel  score.  The  validities  beginning  at  entry  point  7 were  quite  good,  indicating  a 
high  degree  of  accuracy  in  predicting  total  score.  However,  the  cross-validated  validities  were  more  interest 

Table  3 presents  the  V and  CV  group  validities  along  with  the  C group  for  comparison.  It  should  be 
noted  that  the  estimated  total  score  was  computed  using  the  weights  developed  in  toe  C group.  The 


5 


Tabkl.  Item  Difficultiet, 
Group  C 


Tabk  2.  Regrenion  Weii^ti 
ind  Validities,  Group  C 


nwn 

Difficulty 

Entry 

roint 

• 

b 

VaMKy 

1 

.968 

3 

.714 

J88 

.654 

filO 

5 

.656 

.509 

.773 

A 

7 

.617 

.560 

.847 

< 

9 

.578 

.612 

.926 

A 

11 

.555 

.631 

.952 

7 

670 

13 

.524 

.661 

.972 

8 

.819 

15 

.503 

.671 

.981 

9 

.819 

10 

.638 

11 

.915 

12 

.777 

Tabk  3.  Validities 

13 

.777 

by  Entry  Point 

14 

.862 

15 

.894 

Entry 

Oroup 

16 

.840 

roint 

c 

V 

CV 

17 

.840 

18 

.840 

3 

65* 

75 

60 

19 

.723 

5 

77 

78 

69 

20 

.862 

7 

85 

87 

79 

21 

.691 

9 

93' 

93 

83 

22 

.819 

11 

95 

95 

93 

23 

.755 

13 

15 

97 

98 

97 

98 

96 

98 

24 

.926 

25 

.777 

*Deciinal  points  omitted. 

validities  for  the  V group  were  striking  high,  in  some  cases  hi^er  than  the  C group.  This  indicated  that 
the  enor  in  utilizing  “nonoptimal”  regression  weights  and  item  difficulties  was  essentially  non-existent. 
Some  shrinkage  was  encountered  in  the  CV  group.  However,  this  shrinkage  all  but  evaporated  after  entry 
point  1 1.  This  indicated  that  parameters  developed  on  paper-and^rencil  protocols  cross-validate  to  results 
obtained  by  use  of  computer  terminals  for  high  entry  levels. 

Snce  the  items  used  to  construct  the  flexilevel  score  were  also  used  (together  with  additional  items) 
to  compute  the  total  score,  the  validities  reported  in  Table  3 are  inflated  to  some  extent.  The  total  score 
was  computed  by  summing  I’s  and  O’s  corresponding  to  a correct  or  incorrect  item,  whereas  the  flexilevel 
score  was  computed  by  summing  weighted  item  difficulties.  Doubtless,  the  wdghted  item  difficulties  have 
some  built-in  minimum  correlation  with  the  1 -0  protocol. 

Table  4 presents  the  average  percent  of  items  needed  to  terminate  the  flexilevel  algorithm  as  a 
function  of  entry  point.  For  example,  when  entering  at  item  S,  all  three  groups  requited  an  average  of  30 
percent  of  the  total  25  items,  or  7.5  items,  to  terminate  the  algorithm.  The  differences  between  the  C 
sample  and  the  V and  CV  samples  presumably  reflect  an  increase  in  test  items  required  by  using  nonoptimal 
difficulties,  and  thus  a nonoptimal  item  hierarchy  for  flexilevd  branching.  However,  this  effect  was 
decidedly  minimal. 

Table  5 presents  in  terms  of  number  of  items,  the  average,  absolute  error  made  in  predicting  total 
score.  For  example,  entering  at  the  1 1-th  item,  the  estimated  total  score  (5^j)  differs  by  an  average  of  .9 


i 


1 


6 


Table  4.  Percent  Items  Required 
to  Terminate  Testing 


Entry 

Point 

Group 

V 

C 

cv 

3 

20 

20 

19 

5 

30 

30 

30 

7 

41 

40 

41 

9 

52 

50 

52 

11 

62 

60 

62 

13 

70 

69 

72 

15 

78 

77 

80 

Tables.  Item  Error  in 
Predicting  Total  Score 


Entry 

Point 

Oroup 

V 

c 

cv 

3 

2.0 

1.7 

1.9 

5 

1.7 

1.5 

1.8 

7 

1.5 

1.3 

1.4 

9 

1.2 

1.0 

1.3 

11 

.9 

.9 

.9 

13 

.7 

.6 

.7 

15 

.5 

.6 

.5 

itcTi  from  the  known  total  score  (Xj).  Similar  to  Table  3,  these  data  show  comparable  results  across  the 
three  groups  entering  at  item  1 1 and  above. 

Table  6 shows  the  average  percentage  of  error  of  classification  across  various  criterion  levels.  For 
example,  for  a criterion^of  .70  if  Xj>.70  and  Xj>;70  or  if  Xj<70  and  Xj<70,  the  j-th  student  is  properly 
classified.  However,  if  Xj>.70  and  Xj<.70  or  if  Xj  <.70  and  X;  >.70.  there  has  been  a classification  error 
relative  to  the  criterion  of  70  percent.  The  percent  of  these  errors  averaged  over  enterion  levels  40,  .44,  .% 

is  the  statistic  presented  in  Table  6.  Entering  at  item  3,  the  cross-validated  percentage  of  errors  is  about 
1 1.5  percent  which  doubtless  would  be  unacceptably  high  to  most  course  designers.  On  the  other  hand, 
errors  of  6 or  7 percent  might  be  acceptable  when  balanced  against  the  decrease  in  overall  training  time. 


Table  6.  Gassifleation  Error 
by  Entry  Point 


Entry 

Point 

Group* 

V 

c 

CV 

3 

14* 

11 

12 

5 

11 

10 

11 

7 

10 

8 

9 

9 

8 

7 

9 

11 

6 

6 

7 

13 

5 

5 

6 

15 

4 

4 

5 

^Percent  misclassified. 


Conclusions 

Making  any  decision  regarding  the  implementation  of  adaptive  testing  involves  a trade  off  between 
potential  gains  vs.  potential  losses.  It  has  been  shown  that  fairly  substantial  decreases  in  required  test  items 
are  obtainable  with  very  accurate  estimation  of  total  score  (an  empirical  question  remaining  is  whether 
there  is  a decrease  in  testing  time  associated  with  the  decrease  in  test  items).  The  trade-off  is  relative  to  the 
decision  categorizing  an  examinee  incorrectly  as  passing  or  failing  based  on  a flexilevel  score.  The  above 
results  indicate  that  this  type  of  error  ranges  from  about  5 to  12  perecent.  It  should  be  noted,  however. 


7 


that  the  criterion  used  to  gauge  this  enor  was  the  total  score;  this  is  a far  from  ideal  criterion.  What  is 
needed,  of  course,  is  the  “true  score;”  i.e.,  the  unknown  indicator  of  whether  a student  has  accomplished 
the  behavioral  objective,  imperfectly  measured  by  the  total  test  score,  or  not.  Lacking  such  an  indicator  we 
have  used  the  total  score.  However,  there  is  no  reason  why  the  flexileveltest  could  not  be  making  the  more 
valid  decisiotu  relative  to  the  “true  score.”  Indeed,  this  is  one  of  the  theoretical  benefits  attributable  to 
adaptive  testing. 

The  foregoing  data  have  indicated  that  for  reasonable  high  entry  points,  parameters  estimated  from 
paper-and-pencil  test  protocols  cross-validate  remarkably  well  to  groups  actually  tested  at  a computer 
terminal  using  a flexilevel  algorithm.  This  suggests  that  feasibility  studies,  running  actual  subjects,  may  not 
be  called  for.  Rather,  simulated  results  based  on  paper-and-pencil  protocols  may  lead  to  a quick  decision  as 
to  whether  to  implement  adaptive  testing. 


III.  STUDY  n 

Objective 

The  objectives  of  Study  II  were  to  summarize  the  data  collected  under  the  Phase  II  contract  effort, 
and  to  offer  some  conclusions  concerning  the  efficacy  of  flexilevel  testing  in  an  on-going  training 
environment.  The  analysis  was,  of  course,  constrained  by  the  maimer  in  which  the  study  was  implemented; 
however,  the  present  analysis  takes  a somewhat  different  cut  at  the  data. 

Method 

A sample  was  obtained  of  133  PME  students  vriio  block  tested  on  a computer  terminal.  Of  those  133 
protocols,  61  were  Kock  II  tests  and  72  were  Block  IV  tests.  Both  block  tests  contained  40  items;  however, 
the  subject  matter  covered  by  the  tests  was  quite  different. 

A task  analysis  was  done  in  order  to  construct  a hierarchical  structure  for  each  test.  The  task  analysis 
grouped  items  into  five  relatively  homogeneous  scales  according  to  item  content.  The  scales  were  then 
placed  in  a hierarchical  structure  based  on  the  relationships  defined  by  the  task  analysis. 

All  students  entered  the  test  at  the  median  difficulty  item  of  the  first  scale  and  were  presented  items 
based  on  the  flexilevel  algorithm  described  in  Study  I.  After  completing  the  flexilevel  portion  of  each  scale, 
the  students  were  given  the  remainder  of  the  items  and  then  started  at  the  median  difficulty  item  in  the 
next  scale.  This  procedure  was  continued  until  all  five  scales  were  completed. 

Resuhs 

The  items  comprising  the  scales  along  with  their  difficulties  are  presented  in  Table  7 and  Table  8.  As 
in  Study  I,  the  items  were  quite  easy,  the  scale  nriean  difficulties  ranging  from  .81  to  .94  in  Block  11  and 
from  .81  to  .93  in  Block  IV.  Notice,  also,  that  the  average  difficulty  of  a scale  does  not  necessarily 
correspond  to  the  position  of  the  scale  within  the  hierarchy.  That  is,  the  scales  were  ranked  in  the  hierarchy 
not  by  average  difficulty  but  rather,  by  content 

The  variables  of  interest  were  the  proportion  of  items  answered  correctly  during  the  flexilevel  portion 
of  the  test  (Sj)  and  the  flexilevel  score  (Fj),  the  latter  being  modified  slightly  from  Study  1.  Namely,  let  R 
be  the  set  of  items  the  student  got  right,  W the  set  wrong  during  flexilevel  testing,  and  Pj  the  difficulty  of 
the  i-th  item  as  obtained  from  Tables  7 and  8.  Then: 

F = S (l-P)-  E P (3) 

^ icR  keW 

defines  the  flexilevel  score  for  the  j-th  student.  As  well,  we  shall  be  interested  in  the  percent  of  items  saved, 
the  amount  of  time  saved  relative  to  taking  the  full  40  item  test,  and  the  remainder  score  — the  score 
achieved  on  those  items  not  taken  during  the  flexilevel  portion. 


8 


i 


Table  7.  Items  Comprising  Scales  and  Difficulties  for  the  Block  11  Test 
(Cahbmtion  Samph  N > 105} 

Seal*  1 

suit  a 

Suits 

suits 

SUMS 

lt«m 

Difficulty 

Htm 

Dlffleulty 

Htm 

Difficulty 

Htm 

DHfltully 

Item 

Olftimlty 

11 

.97 

24 

.97 

15 

.98 

26 

.89 

34 

.95 

10 

.96 

14 

.96 

29 

.94 

25 

.88 

31 

.94 

6 

.96 

1 

.95 

21 

.94 

39 

.88 

36 

.93 

6 

.95 

5 

.90 

16 

.93 

27 

.81 

37 

.90 

12 

.94 

3 

.90 

20 

.92 

40 

.81 

32 

.85 

7 

.92 

2 

.75 

17 

.89 

28 

.70 

38 

.84 

8 

.86 

23 

.74 

18 

.87 

35 

.77 

13 

.72 

19 

.85 

33 

.63 

4 

.70 

22 

.84 

30 

.51 

Mean  Diff 

.94 

.84 

.91 

.83 

B1 

1 


Table  8.  Items  Comprising  Scales  and  Difficulties  for  the  Block  IV  Test 
(CaUbmtion  Sample  N » 113} 

Scale  1 

Suit  2 

Scales 

Scales 

Suits 

Item 

Diffitmty 

Item 

Diffltuny 

Htm 

Difficulty 

Htm 

DIHItuHy 

Item 

OHfltuHy 

15 

1.00 

1 

.96 

29 

1.00 

'31 

.98 

38 

.96 

16 

1.00 

10 

.90 

26 

.99 

39 

.88 

4 

.95 

18 

1.00 

11 

. .88 

24 

.98 

37 

.88 

14 

.85 

8 

.96 

5 

.88 

23 

.97 

34 

.87 

13 

.84 

21 

.96 

22 

.82 

25 

.94 

32 

.82 

28 

.81 

2 

.92 

35 

.62 

27 

.83 

33 

.70 

17 

.70 

19 

.86 

7 

.61 

30 

.72 

36 

.69 

12 

.81 

40 

.57 

20 

.82 

3 

.67 

6 

.58 

9 

.58 

Mean  Diff 

.93 

.81 

.92 

.80 

.85 

Table  9 contains  the  means,  standard  deviations  and  correlations  with  total  score  for  Fj,  percent 
items  saved,  and  remainder  scores  for  Blocks  11  and  IV.  Both  Sj  and  Fj  were  almost  perfectly  related  to  the 
total  score  as  evidenced  by  the  conelation  of  .98.  This  indicated  that  after  the  student  had  taken  about  70 
percent  of  the  items  in  Block  II  and  75  percent  of  the  items  in  Block  IV,  the  prediction  of  his  total  score 
from  Sj  or  Fj  was  almost  perfect. 

It  was  surprising  that  the  relatively  crude  measure,  Sj,  performed  as  well  as  Fj  which  was  intended  to 
be  the  more  sensitive  measure.  Fj  takes  into  account  the  difficulty  of  the  item  the  student  takes:  passing  an 
item,  i,  which  is  relatively  easy  results  in  a relatively  small  increase  in  score  (l-PO,  and  a larger  increase  for  a 


9 


Tabk  9.  Suminaty  Statistics  for  Dependent  Measures 


Mmsuiv 

Bleak  II 

Bleak  IV 

Mean 

SO 

Carielatlem 

wHh 

Total  Saera 

Maa>’. 

SO 

Cerralatlem 

wNk 

Total  Score 

Total  Score 

.85 

.39 

.82 

.39 

& (Proportion  Correct) 

.82 

.40 

.98 

.79 

.37 

.98 

^ (Flexdevel  Score) 

.56 

.19 

.98 

.47 

.16 

.98 

9b  Items  Saved 

30.4 

.89 

.96 

24.6 

.83 

.91 

Remainder  Score 

.94 

.35 

.72 

.93 

.16 

.66 

relativety  difiicult  item.  Whereas,  missing  a relaiivcly  easy  item,  i,  results  in  a relatively  large  decrease  in 
score  (Pj),  snd  a lesser  decrease  for  a relatively  hard  item.  However,  for  the  parameters  of  the  present  study, 
both  measures  performed  et^ually  well. 

One  can  notice  from  Table  9 that  the  mean  remainder  score  was  substantially  higher  than  the 
correqxxiding  total  score.  This  was  expected,  since  with  relatively  easy  items,  students  tended  to  emerge 
from  each  scale  after  taking  the  most  dilHcult  item.  Therefore,  the  remaining  items  tended  to  be  the  easiest 
items,  with  an  associated  hi^er  score.  Since  the  items  were  relatively  uniform  in  difficulty,  Sj  or  Fj  should 
have  been  a good  estimator  of  the  remainder  score.  In  fact,  the  associated  correlations  were  on  the  order  of 
.55  across  Mocks. 

IVo  questions  remain  to  be  answered.  First,  can  we  accurately  classify  examinees  into  mastery  or 
rton-mastery  states  based  on  scores  (i.e.,  Sj  and  Fj)  calculated  from  the  smaller  item  set?  Second,  was  there 
any  actual  time  savings  associated  with  the  item  savings?  The  data  relevant  to  the  first  question  are  reported 
in  the  next  section. 

/\ 

Oasafication  Analysis.  Regression  equations  for  predicting  total  score  (Xj)  from  both  Sj  and  Fj  were 
computed  (Equation  2).  The  predicted  scores  ()ij)  were  then  compared  to  the  student’s  observed  score  (Xj), 
and  die  number  of  correct  and  incorrect  classifications  were  calculated.  For  both  blocks  the  course 
established  criterion  of  70  percent  was  used  to  define  the  cutoff.  However,  using  the  total  score  as  the 
measure  of  mastery  or  non-mastery  was  subject  to  the  same  criticism  raised  in  Study  I,  namely  that  the 
total  score  is  an  imperfect  measure  of  the  Oatent)  trait  of  interest  — mastery.  The  Mock  II  and  Block  IV 
regression  equations  and  classification  analyses  are  presented  in  Table  10.  As  can  be  seen,  the  predictioi^of 
total  score  pass-fail  from  either  ^ or  Fj  in  Block  II  was  almost  perfect.  That  is,  the  predicted  score  (Xj) 
misclaisified  only  1.6  percent  of  the  sample. 

In  Block  IV,  Fj  classified  examinees  somewhat  more  accurately  than  Sj  (i.e.,  97.2  percent  vs.  94.4 
percent).  However,  the  erron  in  classification  based  on  Sj  were  conservative  since  they  classified  students  as 
fading  the  Mock  test  when  they  had  actually  passed. 

Time  Analysis.  Data  were  collected  on  how  long  each  student  took  to  complete  the  flexilevel  portion 
of  the  test  as  well  as  the  amount  of  time  taken  to  complete  the  remainder  of  the  test.  These  times  were 
collected  for  each  scale  in  the  block  tests. 

TaMe  11  presents  the  mean  times  for  Blocks  II  and  IV.  The  time  analysis  was  somewhat 
disappointing,  sutce  the  flexdevel  test  reduced  testing  time  by  only  IS  percent  and  12  percent,  respectively. 
The  procedure  of  starting  each  student  at  the  median  item  of  each  scale  required  a minimum  of  27  items 
before  the  flexdevel  test  was  completed.  Moreover,  as  pointed  out  earlier,  those  items  which  were  not  taken 
in  the  flexdevel  portion  tend  to  be  the  easier  items  and,  thus  answered  relatively  faster. 


10 


A 


\ 

I 


Table  10.  Regression  Equations  and  Clasiificatioa  Analysii 


Bloek  M 

m—k  IV 

X.  = .08  + .94  S. 

Regression  Equations 

^ - .03  + 1.0  S. 

= .49  + .65  F.  5c.  = .48  + .72 


Hit -Miss  Analysis  Using 

Predicted 

Predicted 

Pass 

Fail 

Pass  Fail 

t 

f52 

1 

f 

8 

£ 

57  4 

S 0 

8 

S 

£ 

0 11 

% Correct  98.4 

% Correct  94.4 

Hit-Miss  Analysis  Using 

Predicted  (X.) 

Predicted  (5^j) 

Pass 

FaU 

Pass  Fail 

-a. 

M 

1 

9 

|52 

1 

t 

60  1 

1 

8 

m 

1 

1° 

8 

Ik 

1 

IL 

1 10 

% Correct  98.4 

% Correct  97.2 

I. 


Table  11.  Time  (in  Minutes)  to  Complete  Scales 


Block  II 

Bloek  IV 

Sttal* 

Flax  Naval 

Ramaindar 

Flaxllaval 

Ramnndar 

1 

7.56 

3.13 

9.14 

1.62 

2 

5.27 

0.58 

16.25 

1.62 

3 

15.25 

2.42 

4.05 

0.84 

4 

12.10 

1.03 

16.48 

2.40 

5 

12.51 

N = 55* 

1.98 

6.83 

.93 

N = 65* 

Total  Time  on  Test 

ss 

1.03  hrs.  . 

. 

. . . . 1.00hr 

Flexilevel  Time 

a 

.88  hr  . . 

88  hr 

Proportion  'Dme  Saved  = 

.15  hr  . . 



12  hr 

^Sample  sizes  reduced  due  to  occasional  computer  failure  during  testing. 


CoKhiiioitt 


The  results  of  the  analyses  suggest  several  conclusions  about  the  efficacy  of  flexilcvcl  testing  in  an 
ongoing  training  environment.  First,  the  proportion  correct  during  the  flexilevel  test  (Sj)  is  as  effective  in 
predicting  total  score  as  the  ostensibly  more  sensitive  flexilevel  score  (Fj).  This  fact  was  renected  in  the 
correlation  between  Sj  and  total  score  as  well  as  the  accuracy  of  mastery  or  non-mastery  classificution.  In 
addition,  ^ has  the  advantage  of  being  in  the  metric  that  is  most  familiar  to  both  students  and  instructors. 

It  was  also  concluded  that  the  modest  time  savings  (12  to  IS  percent)  was  due  to  the  parameters  used 
to  implement  flexilevel  testing.  That  is,  entering  at  the  median  item  requires  the  administration  of  at  least 
27  items  before  exit  from  the  test.  In  addition,  items  not  taken  during  the  flexilevel  test  tciuled  to  bo 
easier,  as  evidenced  by  the  remainder  score,  which  would  tend  to  decrease  the  time  a student  needed  to 
complete  these  items.  However,  it  should  be  pointed  out  that  even  a 15  percent  time  saving  applied  to  the 
large  number  of  students  in  AIS  courses  will,  in  the  long  run,  reflect  an  economically  signiftcant  time 
savings. 

Finally,  the  selection  of  the  parameters  for  this  study  led  us  to  speculate  about  potentially  realiitable 
savings  due  to  alternate  flexilevel  strategies.  The  following  study  was  designed  to  investigate  that  problem. 


IV.  STUDY  III 

Objective 

The  results  of  Study  II  were  obvioudy  contingent  on  the  parameters  chosen  to  implement  tlie  study. 
For  example,  examinees  always  began  on  the  median  item  of  a scale  and  took  all  scales.  An  alternative  was 
to  use  the  flexilevel  algorithm  at  the  scale  level  as  well  as  the  item  level;  i.e.,  if  a scale  is  passed,  the  next 
hardest  scale  is  taken,  or,  if  failed,  the  next  easiest  is  taken,  and  so  on.  Study  I has  shown  tliat  the 
simulation  of  the  flexilevel  algorithm  on  paper-and-pencil  test  protocols  closely  approximates  results 
obtained  during  testing  via  a computer  terminal.  Therefore,  it  was  decided  to  simulate,  using  Study  II  test 
protocok,  the  effects  of  adaptive  movement  across  scales  on  the  various  dependent  measures.  In  addition  to 
implementing  the  flexilevel  algorithm  across  scales,  the  simulation  considered  two  other  variables.  First,  the 
depth  or  item  entry  level  within  a scale  was  varied  similar  to  Study  1.  Second,  this  depth  notion  was 
extended  to  the  scale  level  by  varying  the  starting  scale  between  the  hardest  and  easiest. 

Because  of  the  overlap  in  item  difficulties  between  the  original  scales  the  items  were  reordered  into 
scales  based  entirely  on  the  difficulty  indicies  obtained  in  the  calibration  sample.  The  scales  were  formed  by 
ranking  the  items  according  to  difflculty  and  then  forming  scales  with  non-overlapping  item  difficulties. 
The  position  of  a scale  in  the  hierarchy  was  determined  by  the  average  difficulty  of  the  scale.  Table  1 2 
contains  the  new  scales  for  the  Block  II  and  Block  IV  tests. 

Method 

The  133  test  protocols  obtained  during  Study  il  were  used  as  the  data  in  this  study.  Tlie  simulation 
consisted  of  varying  the  levels  of  three  parameters  and  measuring  the  effects  on  the  various  dependent 
measures.  The  three  parameters  manipulated  were:  (a)  scale  pass  criterion  (SPC),  (b)  scale  start  (SS),  and  (c) 
scale  entry  level  (EL).  These  are  deflned  as  follows. 

EL  was  used  the  same  way  as  in  Study  I.  It  defined  the  item  number  within  each  scale  where  tlie 
flexilevel  algorithm  was  started.  EL  was  varied  between  1 and  5.  If  EL  = 1 the  hardest  item  was  given  first, 
and  if  EL  = 5 the  5th  hardest  item  was  given  first.  EL  also  deflned  the  minimum  number  of  items  that  had 
to  be  taken  before  testing  within  a particular  scale  was  completed.  For  example,  with  EL  - 1 at  least  one 
item  had  to  be  taken;  if  it  was  passed,  testing  was  complete  for  that  scale;  if  failed,  at  least  one  more  was 
taken  (the  next  easiestX  and  so  on. 


12 


T<U>le  12.  Iteim  Compiking  Scales  and  Difficulties 


Seal*  t 

SmM  s 

SsilaS 

smm  s 

ttam 

OHI 

om 

Hmii 

tHH 

HMI 

«« 

tMff 

Block  11 

IS 

.98 

29 

.94 

5 

.90 

18 

.87 

35 

.77 

11 

.97 

21 

.94 

37 

.90 

8 

.86 

2 

.75 

24 

.97 

12 

.94 

3 

.90 

19 

.85 

23 

.74 

14 

.96 

31 

.94 

17 

.89 

32 

.85 

13 

.72 

9 

.% 

16 

.93 

26 

.89 

38 

.84 

28 

.70 

10 

.96 

36 

.93 

39 

.88 

22 

.84 

4 

.70 

6 

.95 

7 

.92 

25 

.88 

27 

.81 

33 

.63 

34 

.95 

20 

.92 

40 

.81 

30 

.51 

1 

.95 

XDiff 

.96 

.93 

.89 

.84 

.69 

Block  IV 

15 

1.00 

1 

.96 

5 

.88 

27 

.83 

36 

.69 

16 

1.00 

8 

.96 

11 

.88 

20 

.82 

3 

.67 

18 

1.00 

21 

.96 

37 

.88 

22 

.82 

35 

.62 

29 

1.00 

38 

.96 

39 

.88 

32 

.82 

7 

.61 

26 

.99 

4 

.95 

34 

.87 

12 

.81 

6 

.58 

24 

.98 

25 

.94 

19 

.86 

28 

.81 

9 

.58 

31 

.98 

2 

.92 

14 

.85 

10 

.72 

40 

.57 

23 

.97 

10 

.90 

13 

.84 

17 

.70 

33 

.70 

XDiff 

.99 

.94 

.87 

;78 

.62 

SS  defined  the  scale  within  which  testing  wu  started,  and,  thus,  took  the  values  1 through  5.  If  SS  = 
S (the  hardest  scale)  were  passed  or  if  SS  « 1 (the  easiest  scale)  were  failed,  only  one  scale  need  be  taken, 
i.e.,  testing  was  complete. 

When  flexileveling  at  the  item  level,  the  1-0  item  score  was  used  to  define  the  next  item  to  be  given; 
i.e.,  a 1 implied  a harder  item  and  a 0 an  eatier  one.  In  a real  sense,  this  was  the  criterion  for  movement 
between  items.  In  a similar  vein,  a criterion  for  movement  between  scales  was  needed.  This  was  complicated 
by  variable  entry  (EL),  since  EL  * 1 implied  possible  scale  scores  of  1.0,  .50,  33  and  so  on,  whereat  other 
values  of  EL  implied  other  ranges  of  scale  scores.  Therefore,  SPC  was  operationalized  in  the  not  wholly 
satisfactory  sense  of  how  many  items  were  missed.  Thus,  SPC  was  varied  between  0 and  3 where  a 
particular  value  defines  the  nuximum  number  of  items  that  could  be  misaed  in  order  to  pats  the  scale. 

The  assumption  of  item  independence,  which  was  important  in  Study  I,  was  also  relevant  in'  this 
study.  Namely,  that  a subject  taking  a particular  item  in  a different  order  would  give  the  tame  response  at 
he  gave  in  the  original  order.  To  the  extent  that  this  astumpUon  is  true,  the  results  presented  as  follows 
reflect  potentially  obtainable  outcomes  from  a variety  of  flexilevel  strategiet. 

Results  and  Discussion 

Simulations.  The  computer  simulation  was  used  to  generate  the  values  of  various  dependent  variables 
for  all  possible  combinatioru  of  the  three  parameters  for  both  Block  II  and  Blodc  IV.  The  dependent 
variables  were:  (a)  percent  items  saved,  (b)  percent  classified  correctly  by  S),  (c)  percent  classified  correctly 
by  Fj,  and  (d)  correlations  with  total  score  for  ^ and  Fj. 


13 


Table  13  presents  the  results  of  the  simulation  nins  for  Block  II.  Similar  to  Study  I,  EL  stronj^y 
affects  the  dependent  measures.  Since  EL  implied  the  minimum  number  of  items  a student  must  take,  the 
percent  of  items  saved  (%  saved)  varied  inversely  with  this  parameter  (i.e.,  maximum  items  saved  with 
minimum  EL).  Also,  as  EL  increased,  the  predictiveness  of  S and  F increased.  This  also  was  expected,  since 
as  EL  increased,  the  item  composite  upon  whidi  S and  F was  based  increased  in  size  and  thus  reliability. 
Finally,  as  predictiveness  increases,  the  percent  of  examinees  correctly  classified  would  be  erqrected  to 
iiKrease,  as  it  in  fact  does. 


Table  13.  Simulation  Results  for  Block  fl 


ParamaMr 

WSavad 

Claaa  (S) 

Claia  (P) 

Coiralatlant 

"s.T 

0 

67 

.933 

.932 

.829 

.840 

1 

67 

.942 

.945 

.833 

.854 

SPC* 

2 

68 

.946 

.948 

.834 

.851 

3 

69 

.919 

.942 

.830 

.845 

1 

63 

.933 

.936 

.851 

.872 

2 

61 

.949 

.948 

.877 

.893 

SS* 

3 

66 

.948 

.953 

.859 

.869 

4 

71 

.937 

.952 

.819 

.829 

5 

80 

.908 

.921 

.753 

.774 

1 

88 

.884 

.883 

.674 

.691 

2 

77 

.925 

.934 

.817 

.833 

EL* 

3 

66 

.954 

.966 

.861 

.877 

4 

58 

.961 

.966 

.896 

.911 

5 

50 

.949 

.961 

.911 

.925 

‘Averaged  over  the  values  of  the  other  two  parametera. 


The  Striking  aspect  of  Table  13  was  the  very  large  savings  in  items  obtainable  with  various  flexilevel 
strategies;  this  was  particularly  dramatic  for  EL.  At  EL  = 1 only  12  percent  of  the  items  were  required  to 
correctly  classify  nearly  90  percent  of  the  testees.  At  EL  = 2,  only  23  percent  of  the  original  items  were 
required  to  classify  over  90  percent  This  is  in  contrast  to  the  Study  II  strategy,  which  saved  30  percent  in 
Block  II  and  25  percent  in  Block  IV,  while  correctly  classifying  98  percent  and  96  percent,  respectively.  It 
was  apparent  that  for  only  a modest  decrease  in  correct  classification,  an  enormous  increase  in  test  items 
saved  could  be  realized.  If  the  relationship  between  items  saved  and  time  saved  found  in  Study  I were 
extrapolated  to  the  present  results,  a 36  percent  savings  in  test  time  could  be  realized  at  EL  = 2. 

The  relationship  of  the  other  parameters  to  the  dependent  measures  was  less  dear.  SS  would  be 
expected  to  introduce  a bow-sh^d  effect  on  the  dependent  variables,  since,  similar  to  EL,  SS  implies  the 
mbtimum  number  of  scales  which  must  be  taken  to  complete  testing:  SS  = 3 im^dies  at  least  three  scales, 
SS  ■ 2 or  4 implies  at  least  2 and  SS  ° 1 or  5 implies  at  least  one.  This  effect  can  be  seen  to  some  extent  in 
the  dassification  fiinctioru  and  validities  - increase  to  SS  = 2 or  3,  and  then  decrease.  Turning  to  SPC,  there 
was  little  to  choose  from  in  terrru  of  an  optimal  value.  The  results  for  SPC  were  periiaps  idiosyncratic  to  the 
generally  easy  nature  of  the  test  items;  i.e.,  varying  SPC  had  minimal  implicatioru  for  all  but  the  least 
prepared  student. 

Table  14  presents  the  simulation  results  for  the  Hock  IV  test.  Again,  EL  had  the  strongest  effect  on 
each  dependent  variable.  Indeed  the  pattern  for  Block  IV  wu  much  the  same  as  the  pattern  reported  for 
Block  II.  Looking  across  these  blocks’  results  suggested  that  generally  optimum  values  for  the  parameters 
were  SPC  • 2,  SS  • 3,  and  EL  ■ 3. 


14 


Table  14.  Simulation  Results  for  Block  IV 


r 


earanwiar 

WSavM 

Clau  (S) 

Clau  (P) 

CorralatleiM 

"t.T  **F,T 

0 

66 

.895 

.911 

.809 

.82 

1 

66 

.888 

.919 

.814 

.843 

SPC*  2 

69 

.886 

.915 

.818 

.847 

3 

69 

.884 

.900 

.809 

.83 

1 

63 

.887 

.908 

.823 

.858 

2 

60 

.906 

.926 

.862 

.883 

SS*  3 

63 

.906 

.926 

.846 

.861 

4 

69 

.894 

.917 

.812 

.829 

5 

79 

.848 

.878 

.721 

.749 

1 

88 

.853 

.862 

.639 

.656 

2 

77 

.898 

.910 

.820 

.842 

EL*  3 

65 

.895 

.927 

.856 

.882 

4 

56 

.895 

.925 

.868 

.897 

5 

49 

.899 

.931 

.879 

.904 

*Averaged  over  the  values 

of  the  other  two  parameters. 

Table  IS  presents  the  values  of  the  dependent  variables  for  the  Block  II  and  Block  IV  simulations 

using  the  parameter  values  indicated  previously.  These  results  indicated  that 

using  approximately  48 

percent  of  the  items,  classifled  perfectly  in  Block  II,  and  about  93  percent  in  Block  IV.  The  correlations  of 

both  S and  F with  the  total  score  were  also  quite  high.  This  suggested  that  total  score  could  be  predicted 

very  accurately  from  either  score  (a  fact  brou^^t  out  by  classification  data). 

T(U>le  15.  Simulation  Results:  SPC  = 2,  SS  = 3,  EL  ^ 3 

CorrMatlona 

% Saved 

CIIM  (S) 

ClaM  (F) 

"a.T 

«F,T 

»ock  11 

54 

1.00 

1.00 

.94 

.95 

Block  IV 

51 

.93 

.94 

.91 

.93 

Since  the  simulation  results  for  the  two  block  tests  were  so  similar,  a series  of  regression  analyses  were 

run  in  order  to  test  the  generalizability  of  the  results  across  blocks.  Basically,  the  results  presented  to  this 

point  addressed  the  question  of  what  kinds  of  item  savings  and  classiflcation  accuracy  could  be  expected  by 
various  flexilevel  strategies.  The  overriding  question,  however,  was  the  extent  to  which  these  results 
generalize  from  block  to  block.  If  the  simulation  results,  or  for  that  matter  the  empirical  results  from  Study 
II,  are  block  specific  (i.e.,  content  or  item  characteristic  speciflc),  then  they  are  of  little  value  in  forecasting 
what  would  h^ipen  in  a new  block.  If  the  results  show  generalizability  across  the  two  blocks  of  PME,  there 
is  evidence  that  imfdementing  a particular  flexilevel  testing  strategy  in  a new  block,  or  even  a new  course 
with  similar  item  characteristics,  would  yield  similar  outcomes. 

Regression  Analyses.  Ihe  predictability  of  the  dependent  variables  generated  in  the  simulation  studies 
was  assessed  in  a number  of  regression  analyses.  The  predictor  variables  were  the  original  parameters  SS, 


15 


SPC  and  EL,  plus  certain  nonlinear  predictors.  These  latter  predictors  were  derived  from  the  original  three 
parameters  and  were  EL* , EL  x SPC,  1 n(EL),  1 n (EL)  x SS,  EL  x SS,  SS* , SS*  and  ISS  - 3 1. 


The  inclusion  of  these  derived  variables  produced  a total  of  nine  predictors.  Regression  runs  were 
done  separately  by  block  for:  (a)  percent  items  saved,  (b)  percent  conectly  classified  by  S (Qass  (S)),  (c) 
percent  correctly  classified  by  F (Gass  (F)X  (d)  the  constant  term  for  predicting  the  total  score  from  the  S 
score  (as),  (e)  the  b weight  for  predicting  total  score  from  the  S score  (bs),  (0  the  constant  for  predicting 
the  total  score  from  the  F score  (ap).  and  (g)  the  b weight  for  predicting  ^e  total  score  from  the  F score 
(bp).  In  addition,  regression  analyses  were  run  after  combining  Nock  11  and  Block  IV  data. 

The  Block  II,  ffiock  IV,  and  total  regression  analyses  for  the  percent  items  saved  criterion  are 
presented  in  TaUe  16.  As  can  be  seen  percent  items  saved  was  highly  predictable  in  all  three  analyses,  with 
EL  having  the  greatest  weight.  This  was  consistent  with  the  results  presented  in  Tables  13  and  14,  and  was 
highly  consistent  across  blocks,  as  well  as  when  the  block  data  were  pooled.  This  was  reflected  in  the 
multiple  correlations  for  each  analysis  as  well  as  the  consistency  of  the  beta  weights  across  the  three 
analyses. 

Table  16.  Regression  Analyses 
for  % Items  Saved 


Block  II 

Block  IV 

Total 

Vartabto 

Beta 

walfhti 

Beta 

wcl9kt( 

BaU 

wdghU 

EL 

-1.35 

-135 

-1.34 

SS* 

.24 

.24 

.24 

ELxSS 

35 

.22 

.28 

ISS-31 

.10 

.11 

.11 

EL* 

.21 

.28 

.25 

EL  X SPC 

.07 

.11 

.09 

Multi{de  R 

.96 

.93 

.94 

R Square 

.92 

.87 

.89 

Table  17  presents  the  three  regression  analyses  using  Gass(S)  as  the  criterion.  These  analyses  were  not 
as  consistent  as  those  reported  in  Table  16.  Again,  EL  or  some  transformation  of  EL  was  the  most 
important  variable  in  predicting  the  classification  power  of  Sj.  Gass(S)  was  not  as  predictable  as  the  percent 
items  saved  as  evidenced  by  the  relatively  low  multiple  R’s  in  comparison  to  those  reported  in  Table  16. 
Furthermore,  a different  set  of  predictors  was  deflned  for  each  analysis,  however,  the  relative  ranking  of  the 
common  predictors  (viz,  EL,  SS*,  and  EL*)  was  the  same  across  all  three  analyses. 


Table  17.  Regression  Aiulyses 
for  Predicting  Class(S) 


Variable 

Block  II 

Block  IV 

Total 

Bela 

walilitt 

Bata 

walahti 

Bata 

writhtt 

EL 

134 

1.75 

1.29 

SS* 

-.63 

-36 

-.26 

EL* 

-1.33 

-.67 

-.94 

ISS-31 

-.28 

-.21 

ln(EL) 

-.83 

ELxSS 

.68 

SPC 

-.10 

Multiple  R 

.81 

.68 

.60 

R Square 

.66 

.46 

36 

16 


Table  18  presents  the  analyses  using  class  (F)  as  the  criterion  variable.  Again,  EL  was  the  variable 
which  had  the  largest  beta  weight  followed  by  EL* . In  this  analysis  the  four  common  variables  retained 
their  relative  importance  across  the  blocks  and  the  total  sam(4e  analyses.  The  multiple  correlations  were 
somewhat  higlier  than  those  obtained  in  the  Qass  (S)  analysis  especially  for  the  total  sample. 


nble  18.  Regression  Analyses 
for  Predicting  Oaas  (F) 


■leek  II 

■la«li  IV 

Total 

Vartobl« 

Bata 

waBMt 

walflilt 

Ma 

waHhti 

EL 

i.42 

1.45 

1.39 

EL* 

-1.48 

-.99 

-1.16 

ISS-31 

-.09 

-.24 

-.16 

EL  X SPC 

.17 

SS* 

-.49 

-.26 

-.36 

ELxSS 

.69 

.33 

Multiple  R 

.87 

.70 

.71 

R Square 

.75 

.49 

.51 

Table  19  contains  analyses  for  as-  As  evidenced  by  the  multiple  R’s,  as  is  almost  perfectly 
predictable.  As  was  the  case  in  all  other  analyses,  EL  was  the  most  predictive. 


Table  19.  Regression  Analyses 
Predicting  a^ 


Villa  Ma 

Bloek  II 

Bata 

walfhtt 

Block  IV 

■au 

wiiahtt 

Total 

Bata 

walfhtt 

EL 

-1.46 

-1.59 

-1.83 

SS* 

.47 

.33 

.35 

EL  X SPC 

.16 

.30 

.24 

ln(EL)xSS 

.25 

-.24 

EL* 

.41 

.12 

.25 

ISS-31 

-.06 

-.06 

-.05 

SPC 

.11 

.08 

.09 

ELxSS 

.44 

.60 

In(EL) 

.20 

.43 

Multiple  R 

.99 

.99 

.98 

R Square 

.99 

.98 

.96 

Table  20  contains  the  regression  analyses  using  bs  as  the  criterion.  Again  the  most  important 
predictor  in  all  three  analyses  was  EL  The  prediction  for  all  three  analyses  was  essentially  perfect  with  the 
lowest  multiple  correlation  coefficient  being  .98.  This  suggests  that  the  b weight  for  the  S score  in 
predicting  the  total  score  was  highly  predictable  from  the  three  parameters  studied  in  the  simulations.  The 
results  of  the  as  analyses  also  suggest  that  the  constant  term  in  the  regression  equation  is  hi^y  predictable 
(see  Table  19). 


Table  20.  Regression  Analytes 
Predicting 


VartabM 

Moak  II 

Moak  IV 

Total 

Ma 

wtlfMl 

BMa 

wMSMi 

EL 

2.18 

1.41 

1.90 

SS* 

-.40 

-.38 

-.37 

ELxSPC 

-.14 

-.25 

-.20 

ln(EL)xSS 

.38 

-.28 

.20 

EL» 

-.52 

-.24 

-.36 

ISS-31 

.06 

.03 

ln(EL) 

-.6 

-.45 

ELxSS 

-.59 

-.45 

Multiple  R 

.99 

.985 

.98 

R Square 

.99 

.97 

.97 

Tables  21  and  22  present  the  results  for  ap  and  bp,  respectively.  While  ap  was  hi^y  predictable 
within  blocks,  it  was  not  so  predictable  in  the  pooled  sample.  This  suggested  that  the  regressions  were  not 
homogeneous  and,  therefore,  the  parameter  was  speciflc  to  test  content  or  item  characteristics.  In  contrast, 
the  bp  analysis  (Table  22)  shows  remarkable  consistency,  both  across  blocks  and  when  pooled. 


Table  21.  Regression  Analytes 
Predicting  a^ 


■look  II  nook  IV  Totil 


■Mo  nou  Boll 

VailsMi  wWimi  wMfht*  wolthtf 


SS* 

-.99 

-.82 

-.40 

EL 

.57 

.52 

.13 

ISS-31 

.22 

.31 

.12 

lii(EL)xSS 

EL» 

-.56 

.13 

ln(EL) 

-.20 

Multiple  R 

.97 

.96 

.40 

R Square 

.94 

.92 

.16 

Table  22.  Regression  Analyses  for  b^ 


VarUbI* 

Block  II 

Block  IV 

Total 

BaU 

walflhts 

Bata 

walghls 

Bata 

walebtf 

EL 

1.94 

1.63 

1.84 

EL* 

-.59 

-37 

-.49 

ELxSPC 

-.08 

-.18 

-.09 

SPC 

.08 

ln(EL) 

-.58 

-.33 

-.54 

ln(EL)xSS 

.30 

-.16 

.17 

ELxSS 

-.41 

-.31 

SS* 

-.62 

-.59 

-.59 

ISS-31 

.08 

.04 

Multiple  R 

.993 

.978 

.983 

R Square 

.987 

.956 

.966 

Cross-Validation.  To  assess  the  generality  of  the  regression  equations  they  were  cross-validated  by 
the  Block  11  and  total  weights  to  predict  in  Block  IV.  Similarity,  the  weights  obtained  in  Block  IV  and 
total  analyses  were  cross-validated  against  Uock  II  data.  The  results  of  the  cross-validation  analyses  are 
presented  in  Table  23.  As  can  be  seen,  very  little  shrinkage  occurred  for  the  majority  of  the  variables  under 
study.  The  greatest  shrinkage  occuned  for  ap  when  either  block  was  validated  i^nst  the  other. 


I 

I 

\ 


18 


Thble  23.  Multiple  Conelatloiii  OMaiiied  in  Croee-Validatioa  Study 


I "PUJ 


weak  II 

weak  IV 

CrttartoN 

Orlsliial 
MuMIsr*  R 

eV4 

MuRHiM  R 

evT 

MiiRIMi  R 

OfWlMl 

CVS 

CVT 

Save 

.96 

.96 

.96 

.93 

.93 

.93 

aass(S) 

.81 

.67 

.77 

.68 

.54 

.63 

Class  (F) 

.88 

.79 

.84 

.70 

.60 

.68 

% 

1.00 

.98 

.99 

1.00 

.98 

.99 

'’s 

l.OC 

.98 

.99 

.99 

.98 

.98 

*F 

.97 

.62 

.85 

.96 

.63 

.88 

bp 

.99 

.97 

.99 

.98 

.97 

.98 

Concluaoni  and  Recommendations 

Study  III  has  shown  that  large  savings  in  items,  and,  potentially,  test  time,  can  be  realized  through  the 
implementation  of  alternate  flexilevel  strategies.  The  conservative  strategy  adopted  in  Study  U resulted  in  j 

only  modest  item  and  time  savings.  However,  even  these  modest  savings  can  result  in  significant  dollar 
savings  when  amortized  over  the  thousands  of  technical  training  students  for  just  one  year.  Study  III  has  | 

shown  that  significantly  greater  savings  can  be  realized  with  more  efficient  procedures,  in  the  form  of 
optimal  values  for  SFC,  SS,  and  EL 

More  important,  the  cross-validation  results  of  Table  23  suggest  that  for  testing  situations  similar  to  j 

those  studied  here,  a course  designer  can  plan  the  implenrentation  of  flexilevel  testing  with  considerable 
accuracy.  After  making  a determination  of  the  appropriate  strategy;  i.e.,  selecting  levels  of  SPC,  SS,  and  EL 
the  planner  can  estimate  the  amount  of  item  savings  which  will  occur  and  can  make  initial  estimates  of  total 
score.  The  latter  can  be  accomplidied  by  substituting  the  selected  parameters  into  any  of  the  equations  in 
Table  19  to  obtain  a;  then  any  of  the  equations  in  Table  20  to  obtain  b.  These  weights,  then,  are  the 
regression  parameters  to  convert  Sj,  the  percent  of  items  correct,  to  an  estimated  total  percent  correct 

score.  i 

) 

Table  23  offers  some  evidence  those  parameten  discussed  above,  percent  items  uved,  as  and  bs,  are  I 

estimative  independent  of  block  content,  since  the  equations  ate  vhtually  interchangsaUe  between  the 
blocks  studied  in  this  research  and  highly  predictive  u well.  With  the  exception  of  bp,  die  other  dependent 
measures  in  Table  23  are  predktaUe  in  varying  degrees,  often  with  significant  shrinkage.  As  mentioned 
earlier,  shrinkage  is  an  indication  that  the  outcome  measures  are  more  a function  of  the  test  content  or  item 
characteristics,  than  of  the  testing  strategy.  In  these  instances  it  is  important  to  develop  estimates  which  are 
specific  to  the  particular  testing  situation,  hi  any  event,  caution  would  dictate  that  any  newly  impfemented 
flexilevel  testing  program  be  validated  to  determine  its  efficiency. 

The  overall  condusion  from  the  three  studies  would  seem  to  be  that  flexilevel  testing,  with  variable 
entry,  offers  an  easily  imfdemented  testing  procedure  with  potential  for  significant  dollar  uvings  at  minimal 
risk  (in  the  sense  of  misdassification).  Sturto  I and  HI,  the  simulation  studies,  show  the  potential  power  of 
implementing  alternate  strategies  and  the  great  regularity  of  the  data  obtained. 

Study  I indicated  the  viability  of  simulating  flexlevel  testing  on  paper-and-pendl  protocols  to 
determine  optimal  entry  levds  as  well  as  potential  item  savings.  This  type  of  simulation  can  be 
accompliahed  prior  to  actual  implementation,  or  the  results  from  Study  Ul  can  be  used  directly  to  guide  the 
selection  of  an  optimal  flexflevel  strategy. 

f 

In  any  event,  it  would  seem  appropriate  to  implement  further  flexilevel  testing  in  technical  training 
where  the  availability  of  computer  teirriinab  permltt.  Since,  for  example,  in  the  Advanced  Instructional 
System,  studenu  spend  30  to  40  percent  of  theh  time  in  testing  activities,  it  can  be  seen  that  significant  | 

training  time  reductions  are  potentially  obtainable. 


19 


REFERENCES 


Biysan,  R.  Shortening  tests:  Effects  of  method  used,  length  and  internal  consistency  on  correlation  with 
total  score.  Proceedings  of  the  80th  Annual  Convention  of  the  American  Psy^ological  Association, 
1972, 7-8. 

Geary,  T.A.,  Linn,  R.L.,  & Rock,  D.A.  An  exploratory  study  of  programmed  tests.  Educational  and 
Psychological  Measurement,  1968,  28, 345-360.  (a) 

Geary,  T.A.,  Litm,  R.L.,  & Rock,  D.A.  Reproduction  of  total  test  score  through  the  use  of  sequential 
programmed  tests.  Jourrud  of  Educational  Measurement,  1968,  5, 183-187.  (b) 

Hansen,  D.N.,  Harris,  D.A.,  & Ross,  S.  Flexilevel  adaptive  testing  paradigm:  Validation  in  technical  training. 
AFHRLrTR-77-35(I),  AD-A042  977.  Lowry  AFB,  CO:  Technical  Training  Division,  Air  Force  Human 
Resources  Laboratory,  July  1977.  (a) 

Hansen,  D.N.,  Harris,  D.A.,  & Ross,  S.  Flexilevel  adaptive  testing  paradigm:  Hierarchical  concept  structures. 
AFHRLrTR-77-35(n),  AD-A042  966.  Lowry  AFB,  CO:  Technical  Training  Division,  Air  Force 
Human  Resources  Lalwratory,  July  1977.  (b) 

Hansen,  D.N.,  Johnson,  B.F.,  Fagan,  R.L.,  Tam,  P.,  & Dick,  W.  Computer-based  adaptive  testing  models  for 
the  Air  Force  technical  training  environment  Phase  I:  Development  of  a computerized  measurement 
system  for  Air  Force  technical  training.  AFHRL-TR-7448,  AD-785  142.  Lowry  AFB,  CO;  Technical 
Training  Division,  Air  Force  Human  Resources  Laboratory,  July  1974. 

Linn,  R.L.,  Rock,  D.A.,  & Geary,  T.A.  Sequential  testing  for  dichotomous  decisions.  College  Entrance 
Examination  Board  Research  and  Development  Report.  RDR  60-80,  No.  3,  1970  (ETS,  Rb-70, 31). 

Lord,  F.M.  The  self-scoring  flexilevel  test.  Jourrud  of  Educational  Measurement,  1971,  8, 147-151.  (a) 

Lord,  F.M.  A theoretical  study  of  the  measurement  effectiveness  of  flexilevel  tests.  Educational  and 
Psychological  Measurement,  1971, 31, 808-813.  (b) 

Patterson,  J.J.  An  evaluation  of  the  sequential  method  of  psychological  testing.  Unpublished  doctoral 
dissertation,  Michigan  State  University,  1962. 


■<lrU.S.  GOVERNMENT  PRINTING  OFFICE:  1978-771-122/98 


20 


